Welcome to the third practical on text mining!
The aim of this practical is to enhance your understanding in sentiment
analysis and learn three different ways of performing sentiment
analysis.
In this practical, we will focus on the following methods:
In this practical, we make use of the following packages:
library(tm)
library(text2vec)
library(tidyverse)
library(tidytext)
library(ggplot2)
library(caret)
library(rpart)
library(rpart.plot)We are going to use one data set movie_review in this
practical:
text2vec package. This data set consists of 5000 IMDB movie
reviews, specially selected for sentiment analysis. The sentiment of the
reviews is binary, meaning an IMDB rating < 5 results in a sentiment
score of 0, and a rating >=7 has a sentiment score of 1. No
individual movie has more than 30 reviews. Load this data set and
convert it to a dataframe.# load an example dataset from text2vec
data("movie_review")
as_tibble(movie_review)The tidytext package contains 4 general purpose lexicons
in the sentiments dataset.
AFINN: list of English words rated for valence between
-5 and +5bing: list of positive and negative sentimentnrc: list of English words and their associations with
8 emotions (anger, fear, anticipation, trust, surprise, sadness, joy,
and disgust) and 2 sentiments (negative and positive); binaryloughran: list of sentiment words for accounting and
finance by category (Negative, Positive, Uncertainty, Litigious, Strong
Modal, Weak Modal, Constraining)bing lexicon in this practical.
Using the get_sentiments function, load the “bing”
dictionary and store it in an objects called
bing_sentiments.bing_sentiments <- get_sentiments("bing")
bing_sentimentsunnest_tokens function from
tidytext package to break the text into individual tokens
(a process called tokenization)
and use head function to see its first several rows.# tokenize the reviews
tidy_review <- movie_review %>%
unnest_tokens(word, review) %>%
select(-sentiment) ## we don't use the original sentiment?
head(tidy_review)inner_join function to find a sentiment score
for each of the tokenized review words using Bing lexicon (i.e.,
bing_sentiments).review_sentiment <- tidy_review %>%
inner_join(bing_sentiments)
head(review_sentiment)ids. Then, compute the net sentiment score by
subtracting the count of negative words from the positive words.Hint: You can use count function from
dplyr package.
review_sentiment <- review_sentiment %>%
# count the sentiment (positive/negative) per id
count(id, sentiment) %>%
# wide-format
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
# compute the sentiment score
mutate(sentiment_score = positive - negative)ids.Hint: Map id onto x-axis and
sentiment_score onto y-axis.
review_sentiment %>%
mutate(color = ifelse(sentiment_score > 0, "pos", "neg")) %>%
ggplot(aes(x = reorder(id, -sentiment_score), y = sentiment_score, fill = color)) +
geom_col() + labs(x = "id") +
theme_classic() + theme(axis.text.x=element_blank()) Hint: You can use table function.
# Dichotomize the sentiment score to match with the original sentiment scores
review_sentiment <- review_sentiment %>%
# 0: scores lower than five, 1: scores higher than five
mutate(dicho_sentiment = ifelse(sentiment_score < 5, 0, 1))
## not sure why 5 rows are gone?!
movie_review <- movie_review %>%
filter(id %in% review_sentiment$id)
## terrible ...
table(true = movie_review$sentiment, predicted = review_sentiment$dicho_sentiment) predicted
true 0 1
0 1830 651
1 1811 703